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We propose a new approach to analyze data that naturally lie 
on manifolds. We focus on a special class of manifolds, called direct 
product manifolds, whose intrinsic dimension could be very high. Our 
method finds a low-dimensional representation of the manifold that 
can be used to find and visualize the principal modes of variation 
of the data, as Principal Component Analysis (PCA) does in linear 
spaces. The proposed method improves upon earlier manifold ex- 
tensions of PCA by more concisely capturing important nonlinear 
modes. For the special case of data on a sphere, variation following 
nongeodesic arcs is captured in a single mode, compared to the two 
modes needed by previous methods. Several computational and sta- 
tistical challenges are resolved. The development on spheres forms 
the basis of principal arc analysis on more complicated manifolds. 
The benefits of the method are illustrated by a data example using 
medial representations in image analysis. 

1. Introduction. Principal Component Analysis (PCA) has been fre- 
quently used as a method of dimension reduction and data visualization 
for high-dimensional data. For data that naturally lie in a curved mani- 
fold, application of PCA is not straightforward since the sample space is 
not linear. Nevertheless, the need for PCA-like methods is growing as more 
manifold data sets are encountered and as the dimensions of the manifolds 
increase. 

In this article we introduce a new approach for an extension of PCA on a 
special class of manifold data. We focus on direct products of simple mani- 
folds, in particular, of the unit circle S 1 , the unit sphere S 2 , M+ and MP. We 
will refer to these as direct product manifolds, for convenience. Many types 
of statistical sample spaces are special cases of the direct product manifold. 
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A widely known example is the sample space for directional data [Fisher 
(1993), Fisher, Lewis and Embleton (1993) and Mardia and Jupp (2000)] 
and their direct products. Applications include analysis of wind directions, 
orientations of cracks, magnetic field directions and directions from the earth 
to celestial objects. For example, when we consider multiple 3-D directions 
simultaneously, the sample space is S 2 <S> • • • <8> S 2 , which is a direct product 
manifold. Another example is the medial representation of shapes [m-reps, 
Siddiqi and Pizer (2008)], which is somewhat less known to the statistical 
community but provides a powerful parametrization of 3-D shapes of hu- 
man organs and has been extensively studied in the image analysis field. 
The space of m-reps is usually a high-dimensional direct product manifold. 
Some background and necessary definitions on direct product manifolds can 
be found in Appendix. 

Our approach to a manifold version of PCA builds upon earlier work, espe- 
cially the principal geodesic analysis proposed by Fletcher et al. (2004) and 
the geodesic PCA proposed by Huckemann and Ziezold (2006) and Hucke- 
mann, Hotz and Munk (2010). A detailed catalogue of current methodologies 
can be found in Huckemann, Hotz and Munk (2010). An important approach 
among these is to approximate the manifold by a linear space. Fletcher et al. 
(2004) take the tangent space of the manifold at the geodesic mean as the 
linear space, and work with appropriate mappings between the manifold and 
the tangent space. This results in finding the best fitting geodesies among 
those passing through the geodesic mean. This was improved in an impor- 
tant way by Huckemann, Hotz and Munk, who found the best fit over the 
set of all geodesies. Huckemann, Hotz and Munk went on to propose a new 
notion of center point, the PCmean, which is an intersection of the first two 
principal geodesies. This approach gives significant advantages, especially 
when the curvature of the manifold makes the geodesic mean inadequate, 
an example of which is depicted in Figure 2b. 

Our method inherits advantages of these methods and improves further 
by effectively capturing more complex nongeodesic modes of variation. Note 
that the curvature of direct product manifolds is mainly due to the spherical 
part, which motivates careful investigation of S 2 -valued variables. We point 
out that (small) circles in S 2 , including geodesies, can be used to capture 
the nongeodesic variation. We introduce the principal circles and principal 
circle mean, analogous to, yet more flexible than, the geodesic principal 
component and PCmean of Huckemann, Hotz and Munk. These become 
principal arcs when the manifold is indeed S 2 . For more complex direct 
product manifolds, we suggest transforming the data points in S 2 into a 
linear space by a special mapping utilizing the principal circles. For the 
other components of the manifold, the tangent space mappings can be used 
to map the data into a linear space as done in Fletcher et al. (2004). Once 
manifold-valued data are mapped onto the linear space, then the classical 
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Fig. 1. S 2 -valued samples (n = 60) of the prostate m-reps with 15 medial atoms. One 
sample in this figure is represented by a 30-tuple of 3-D directions (two directions at each 
atom), which lies in the manifold (&]1 1 (S 2 ® S 2 ). Small and great circles are fitted and 
plotted in the rightmost atoms to emphasize the sample variation along small circles. 

linear PCA can be applied to find principal components in the transformed 
linear space. The estimated principal components in the linear space can be 
back-transformed to the manifold, which leads to principal arcs. 

We illustrate the potential of our method by an example of m-rep data 
in Figure 1. Here, m-reps with 15 sample points called atoms model the 
prostate gland (an organ in the male reproductive system) and come from 
the simulator developed and analyzed in Jeong et al. (2008). Figure 1 shows 
that the S 2 components of the data tend to be distributed along small circles, 
which frequently are not geodesies. We emphasize the curvature of variation 
along each sphere by fitting a great circle and a small circle (by the method 
discussed in Section 2). Our method is adapted to capture this nonlinear 
(nongeodesic) variation of the data. A potential application of our method 
is to improve accuracy of segmentation of objects from CT images. Detailed 
description of the data and results of our analysis can be found in Section 5. 

Note that the previous approaches [Fletcher et al. (2004), Huckemann and 
Ziezold (2006)] are defined for general manifolds, while our method focuses 
on these particular direct product manifolds. Although the method is not 
applicable for general manifolds, it is useful for this common class of mani- 
folds that is often found in applications. Our results inform our belief that 
focusing on specific types of manifolds allow more precise and informative 
statistical modeling than methods that attempt to be fully universal. This 
happens through using special properties (e.g., presence of small circles) that 
are not available for all other manifolds. 

The rest of the article is organized as follows. We begin by introducing a 
circle class on S 2 as an alternative to the set of geodesies. Section 2 discusses 
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principal circles in S 2 , which will be the basis of the special transformation. 
The first principal circle is defined by the least-squares circle, minimizing the 
sum of squared residuals. In Section 3 we introduce a data-driven method 
to decide whether the least-squares circle is appropriate. A recipe for prin- 
cipal arc analysis on direct product manifolds is proposed in Section 4 with 
discussion on the transformations. A detailed introduction of the space of 
m-reps and the results from applying the proposed method follow. A novel 
computational algorithm for the least-squares circles is presented in Sec- 
tion 6. In the Appendix we provide some necessary background for treating 
direct product manifolds as sample spaces, including the notion of geodesic 
mean, tangent space, exponential map and log map. 

2. Circle class for nongeodesic variation on S 2 . Consider a set of points 
in M 2 . Numerous methods for understanding population properties of a data 
set in linear space have been proposed and successfully applied, which in- 
clude rigid methods, such as linear regression and principal components, and 
very flexible methods, such as scatterplot smoothing and principal curves 
[Hastie and Stuetzle (1989)]. In this paper we make use of a parametric class 
of circles, including small and great circles, which allows much more flexibil- 
ity than either methods of Fletcher et al. (2004) or Huckemann, Hotz and 
Munk (2010), but less flexibility than a principal curve approach. Although 
this idea was motivated by examples such as those in Figure 1, there are 
more advantages gained from using the class of circles: 

(i) The circle class includes the simple geodesic case. 

(ii) Each circle can be parameterized, which leads to an easy interpretation. 

(iii) There is an orthogonal complement of each circle, which gives two im- 
portant advantages: 

(a) Two orthogonal circles can be used as a basis of a further extension 
to principal arc analysis. 

(b) Building a sensible notion of principal components on S 2 alone is 
easily done by utilizing the circles. 

The idea (iii)(b) will be discussed in detail after introducing a method 
of circle fitting. Some notations follow: S 2 can be thought of as the unit 
sphere in M 3 , so that a unit vector x E R 3 is a member of S 2 . The geodesic 
distance between x, y € S 2 , denoted by p(x, y), is defined as the shortest 
distance between x, y along the sphere, which is the same as the angle 
formed by the two vectors. Thus, /?(x,y) =cos _1 (x'y). A circle on S 2 is 
conveniently parameterized by center c € S 2 and geodesic radius r, and 
denoted by 5(c, r) = {x E S 2 \p(c,s) = r}. It is a geodesic when r = tt/2. 
Otherwise it is a small circle. 

A circle that best fits the points xi , . . . , x n € S 2 is found by minimizing 
the sum of squared residuals. The residual of Xj is defined as the signed 
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geodesic distance from Xj to the circle S(c,r). Then the least-squares circle 
is obtained by 

n 

(1) miny^(p(xj, c) — r) 2 subject to c E S 2 , r E (0, tt). 

c,r — j 

i=l 

Note that there are always multiple solutions of (1). In particular, whenever 
(c, r) is a solution, (— c,tt — r) also solves the problem as 5(c,r) = 5(—c,tt — 
r). This ambiguity does not affect any essential result in this paper. Our 
convention is to use the circle with smaller geodesic radius. 

The optimization task (1) is a constrained nonlinear least squares prob- 
lem. We propose an algorithm to solve the problem that features a simplified 
optimization task and approximation of S 2 by tangent planes. The algorithm 
works in a doubly iterative fashion, which has been shown by experience to 
be stable and fast. Section 6 contains a detailed illustration of the algorithm. 

Analogous to principal geodesies in S 2 , we can define principal circles in 
S 2 by utilizing the least-squares circle. The principal circles are two orthog- 
onal circles in S 2 that best fit the data. We require the first principal circle 
to minimize the variance of the residuals, so it is the least-squares circle (1). 
The second principal circle is a geodesic which passes through the center of 
the first circle and thus is orthogonal at the points of intersection. More- 
over, the second principal circle is chosen so that one intersection point is 
the intrinsic mean [defined in (2) later] of the projections of the data onto 
the first principal circle. 

Based on a belief that the intrinsic (or extrinsic) mean defined on a curved 
manifold may not be a useful notion of the center point of the data [see, e.g., 
Huckemann, Hotz and Munk and Figure 2b], the principal circles do not use 
the pre-determined means. To develop a better notion of center point, we 
locate the best 0-dimensional representation of the data in a data-driven 
manner. Inspired by the PCmean idea of Huckemann, Hotz and Munk, given 
the first principal circle 5\ , the principal circle mean u E 5\ is defined (in an 
intrinsic way) as 

n 

(2) u = argmin ^ p 2 (u, P Sl x,), 

where .P^x is the projection of x onto 6\, that is, the point on <5i of the 
shortest geodesic distance to x. Then 

_ xsin(r) + csin(p(x, c) - r) 
v ' ; sm(p(x,cj) 

as in equation (3.3) of Mardia and Gadsden (1977). We assume that c is the 
north pole e3, without losing generality since otherwise the sphere can be 
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rotated. Then 



(4) 



p(u,P Sl x)=sm(r)p s i 



(ui,u 2 ) (xi,x 2 ) \ 



where u = (ui,u 2 , U3)' , x = (x\, x 2 , £3)' and pgi is the geodesic (angular) dis- 
tance function on S 1 . The optimization problem (2) is equivalent to finding 
the geodesic mean in S . See equation (17) in the Appendix for computation 
of the geodesic mean in S l . 

The second principal circle 5 2 is then the geodesic passing through the 
principal circle mean u and the center c of <5i. Denote 5 = <5(xi, . . . , x n ) as 
a combined representation of (<5i,u) or, equivalently, (Si,5 2 ). 

As a special case, we can force the principal circles to be great circles. 
The best fitting geodesic is obtained as a solution of the problem (1) with 
r = 7r/2 and becomes the first principal circle. The optimization algorithm 
for this case is slightly modified from the original algorithm for the varying 
r case by simply setting r = ir/2. The principal circle mean u and the 5 2 for 
this case are defined in the same way as in the small circle case. Note that 
the principal circles with r = ir/2 are essentially the same as the method of 
Huckemann and Ziezold (2006). 

Figure 2 illustrates the advantages of using the circle class to efficiently 
summarize variation. On four different sets of toy data, the first principal 
circle 5± is plotted with principal circle mean u. The first principal geodesies 
from the methods of Fletcher and Huckemann are also plotted with their cor- 
responding mean. Figure 2a illustrates the case where the data were indeed 
stretched along a geodesic. The solutions from the three methods are similar 
to one another. The advantage of Huckemann's method over Fletcher's can 
be found in Figure 2b. The geodesic mean is found far from the data, which 
leads to poor performance of the principal geodesic analysis, because it con- 
siders only great circles passing through the geodesic mean. Meanwhile, the 
principal circle and Huckemann's method, which do not utilize the geodesic 
mean, work well. The case where geodesic mean and any geodesic do not 
fit the data well is illustrated in Figure 2c, which is analogous to the Eu- 
clidean case, where a nonlinear fitting may do a better job of capturing the 
variation than PCA. To this data set, the principal circle fits best, and our 
definition of mean is more sensible than the geodesic mean and the PCmean. 
The points in Figure 2d are generated from the von Mises-Fisher distribu- 
tion with k = 10, thus having no principal mode of variation. In this case 
the first principal circle 5\ follows a contour of the apparent density of the 
points. We shall discuss this phenomenon in detail in the following section. 

Fitting a (small) circle to data on a sphere has been investigated for some 
time, especially in statistical applications in geology. Those approaches can 
be distinguished in three different ways, where our choice fits into the first 
category: 




(c) (d) 

Fig. 2. Toy examples on S 2 with n = 30 points showing the first principal circle (red) as 
a small circle and the first geodesic principal component (dotted green) by Huckemann, and 
the first principal geodesic (black) by Fletcher. Also plotted are the geodesic mean, PCmean 
and principal circle mean of the data as black, green and red diamonds, respectively, (a) 
The three methods give similar satisfactory answers when the data are stretched along 
a geodesic, (b) When the data are stretched along a great circle, covering almost all of 
it, the principal geodesic (black circle) and geodesic mean (black diamond) fail to find a 
reasonable representation of the data, while the principal circle and Huckemann's geodesic 
give sensible answers, (c) Only the principal circle fits well when the data are not along a 
geodesic, (d) For a small cluster without principal modes of variation, the principal circle 
gets too small. See Section 3 for discussion of this phenomenon. 



(1) Least-squares of intrinsic residuals: Gray, Geiser and Geiser (1980) for- 
mulated the same problem as in (1), finding a circle that minimizes sum 
of squared residuals, where residuals are defined in a geodesic sense. 

(2) Least-squares of extrinsic residuals: A different measure of residual was 
chosen by Mardia and Gadsden (1977) and Rivest (1999), where the 
residual of x from <5(c, r) is defined by the shortest Euclidean distance 
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between x and 5(c,r). Their objective is to find 



n n 



n 



argmin ||xj — P<5X,|| 2 = argmin — x^P^Xj 

6 T^i 5 



argmin N — cos (£j ) 
6 U 



where £j denotes the intrinsic residual. This type of approach can be 
numerically close to the intrinsic method as cos(£j) = 1 — £f/2 + 0{^f). 
(3) Distributional approach: Mardia and Gadsden (1977) and Bingham and 
Mardia (1978) proposed appropriate distributions to model 5 2 -valued 
data that cluster near a small circle. These models essentially depend 
on the quantity cos(£), which is easily interpreted in the extrinsic sense 
but not in the intrinsic sense. 

Remark. The principal circle and principal circle mean always exist. 
This is because the objective function (1) is a continuous function of c, with 
the compact domain S 2 . The minimizer r has a closed- form solution (see 
Section 6). A similar argument can be made for the existence of u. On the 
other hand, the uniqueness of the solution is not guaranteed. We conjecture 
that if the manifold is approximately linear or, equivalently, the data set is 
well approximated by a linear space, then the principal circle will be unique. 
However, this does not lead to the uniqueness of u, whose sufficient condition 
is that the projected data on 5\ is strictly contained in a half-circle [Karcher 
(1977)]. Note that a sufficient condition for the uniqueness of the principal 
circle is not clear even in the Euclidean case [Chernov (2010)]. 

3. Suppressing small least-squares circles. When the first principal circle 
8\ has a small radius, sometimes it is observed that 5\ does not fit the data 
in a manner that gives useful decomposition, as shown in Figure 2d. This 
phenomenon has been also observed for the related principal curve fitting 
method of Hastie and Stuetzle (1989). We view this as unwanted overfitting, 
which is indeed a side effect caused by using the full class of circles with free 
radius parameter instead a class of great circles. In this section a data-driven 
method to flag this overfitting is discussed. In essence, the fitted small circle 
is replaced by the best fitting geodesies when the data do not cluster along 
the circle but instead tend to cluster near the center of the circle. 

We first formulate the problem and solution in R 2 . This is for the sake of 
clear presentation and also because the result on R 2 can be easily extended 
to S 2 using a tangent plane approximation. 

Let fx. be a spherically symmetric density function of a continuous dis- 
tribution defined on R 2 . Whether the density is high along some circle 
is of interest. By the symmetry assumption, density height along a circle 
can be found by inspecting a section of /x along a ray from the origin 
(the point of symmetry). A section of /x coincides with the conditional 
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density fx!\x 2 ( x i\ x 2 = 0) = K _1 /x(^ij 0). A random variable corresponding 
to the p.d.f. /xi 1X2=0 * s no ^ directly observable. Instead, the radial dis- 
tance R = ||X|| from the origin can be observed. For the polar coordinates 
(R, 0) such that X= (Xi,X2) = (Rcos@,RsmQ), the marginal p.d.f. of 
R is fn(r) = 27rr/x(V, 0) as /x is spherically symmetric. A section of /x 
is related to the observable density fu as /r(t") oc r/x-JXa^M) f° r ^ > 0. 
This relation is called the length-biased sampling problem [Cox (1969)]. The 
relation can be understood intuitively by observing that a value r of R can 
be observed at any point on a circle of radius r, circumference of which is 
proportional to r. Thus, sampling of R from the density /xi 1x2=0 ^ s propor- 
tional to its size. 

The problem of suppressing a small circle can be paraphrased as "how 
to determine whether a nonzero point is a mode of the function /xi 1X2=0 > 
when observing only a length-biased sample." 

The spectrum from the circle-clustered case (mode at a nonzero point) to 
the center-clustered case (mode at origin) can be modeled as 

(5) data = signal + error, 

where the signal is along a circle with radius fi, and the error accounts for 
the perpendicular deviation from the circle (see Figure 3). Then, in polar 
coordinates (R, 0), is uniformly distributed on (0,27r] and R is a positive 
random variable with mean fj,. First assume that R follows a truncated 
Normal distribution with standard deviation a, with the marginal p.d.f. 
proportional to 

(6) f R ( r ) X< f>(^J±\ forr>0, 

where eft is the standard Normal density function. The conditional density 
/Xi |x 2 =o is th en 

1 1 / (r-fj 1 ) 2 \ 
fx 1 \x 2 =o( r ) oc -fn(r) oc — exp = — for r > 0. 

Nonzero local extrema of /xi 1x2=0 can De characterized as a function of 
(/j,, a) in terms of r + , r_ = {/i ± \/(^~ 2<r)(// + 2a)}/2 as follows: 



• When /U > 2cr, /xi 1X3=0 has a local maximum at r + , minimum at r_. 

• When /i = 2a, r + = r_ = 

• When /i < 2<T, /xi|x 2 =0 is strictly decreasing, for r > 0. 

Therefore, whenever the ratio fj,/a > 2, /xi |Xa=o has a mode at r + . 

This idea can be applied for circles in S 2 with some modification, shown 
next. We point out that the model (5) is useful for understanding the small 
circle fitting: signal as a circle with radius fi, and error as the deviation 
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Fig. 3. Illustration of the conceptual model (5) on R 2 , which can also be understood as 
a local approximation of S . The signal is along the circle centered at c and radius fi . The 
error is perpendicular to the signal. When the deviation a is large, it is possible that the 
amount of error is even greater than the radius [i. This is incorporated in the wrapping 
approach (7). 

along geodesies perpendicular to the circle. Moreover, a spherically sym- 
metric distribution centered at c on S 2 can be mapped to a spherically 
symmetric distribution on the tangent space at c, preserving the radial dis- 
tances by the log map (defined in the Appendix). A modification needs to 
be made on the truncated density It is more natural to let the error 
be so large that the deviation from the great circle is greater than [i. Then 
the observed value may be found near the opposite side of the true signal, 
which is illustrated in Figure 3 as the large deviation case. To incorporate 
this case, we consider a wrapping approach. The distribution of errors (on 
the real line) is wrapped around the sphere along a great circle through c, 
and the marginal p.d.f. fn in (6) is modified to 

oo 

(7) m(r)<xJ2 

k=0 

The corresponding conditional p.d.f., 1x^X2=0^ ls s i muar to fx 1 \x 2 =Q an< ^ 
a numerical calculation shows that fx 1 \X2=0 ^ as a moa ^ e a ^ some nonzero 
point whenever fj,/a > 2.0534, for [i < tt/2. In other words, we use the small 
circle when fi/a is large. Note that in what follows we only consider the first 
term (k = 0) of (7) since other terms are negligible in most situations. We 
have plotted 1x^X2=0 ^ or some selected values of /i and a in Figure 4. 

With a data set on S 2 , we need to estimate \i and a, or the ratio fi/a. Let 
xi,...,x n £ and let c be the samples and the center of the fitted circle, 
respectively. Denote for the errors of the model (5) such that £j ~ N(0, a 2 ). 
Then n = p(xj, c) = which has the folded normal distribution [Leone, 

Nelson and Nottingham (1961)]. Estimation of fi and a based on unsigned rj 
is not straightforward. We present two different approaches to this problem. 



r + 2ixk — [i 



a 



+ 



2irk + n 



a 



for r € [0, ir]. 
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hAj = 2 



t»/a = 3 



a. 

5_x 
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0.2 0.4 0.6 0.8 
r: radial distance from center 



ji fa = 1 
]xh = 2 
p. fa = 3 
£i /a = 4 



n 0.5 




X 




Fig. 4. (Left) Graph of fx 1 \x 2 =o( r ) f or m/c = 1, 2, 3, 4. 27ie density is high at a nonzero 
point when fi/a > 2.0534. (Center, right) Spherically symmetric distributions f corre- 
sponding to fi/a = 2,3. TTie ratio fi/o~ > 2 roughly leads to a high density along a circle. 

Robust approach. The observations ri,...,r n can be thought of as a set of 
positive numbers contaminated by the folded negative numbers. Therefore, 
the left half (near zero) of the data are more contaminated than the right 
half. We only use the right half of the data, which are less contaminated 
than the other half. We propose to estimate \i and a by 



where Q3( < I ) ) is the third quantile of the standard normal distribution. The 
ratio can be estimated by p,/a. 

Likelihood approach via EM algorithm. The problem may also be solved 
by a likelihood approach. Early solutions can be found in Leone, Nelson 
and Nottingham (1961), Elandt (1961) and Johnson (1962), in which the 
MLEs were given by numerically solving nonlinear equations based on the 
sample moments. As those methods were very complicated, we present a 
simpler approach based on the EM algorithm. Consider unobserved binary 
variables Sj with values —1 and +1 so that s^rj ~ N(fi,a 2 ). The idea of the 
EM algorithm is that if we have observed Sj, then the maximum likelihood 
estimator of i? = (/i,rj 2 ) would be easily obtained. The EM algorithm is an 
iterative algorithm consisting of two steps. Suppose that the A;th iteration 
produced an estimate of The E-step is to impute Sj based on r^ and 

by forming a conditional expectation of log-likelihood for 



(8) 



[L = med(r") 



a = (Q3K)-med(r?))/Q 3 ($), 



n 



Q(#) = E log JJ nA 



i=l 



n 



= ^[log/(r 4 |s 4 = +l,tf)P( ai = +l\nA) 



i=l 
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+ log/ (r^; = -M)P(si = -i\n,S k )} 

n 

= Y^^^^Pm +log0(-ri|i?)(l -Pi(fc))], 
i=l 

where / is understood as an appropriate density function, and p^k) i s easily 
computed as 

of ni 3 \ <HrMk) 
Pi(k) = P[Si = +1 In, 0k) = - . § J; r§-r ■ 

The M-step is to maximize whose solution becomes the next estimator 
Now the (fc + l)th estimates are calculated by a simple differentiation 
and given by 

Y n 1 n 

i=l i=l 

With the sample mean and variance of r\ , . . . , r n as an initial estimator 
$0) the algorithm iterates E-steps and M-steps until the iteration changes 
the estimates less than a predefined criteria (e.g., 10 -10 ). fj,/a is estimated 
by the ratio of the solutions. 

Comparison. Performance of these estimators are now examined by a 
simulation study. Normal random samples are generated with ratios fx/a 
being 0, 1, 2 or 3, representing the transition from the center-clustered to 
circle-clustered case. For each ratio, n = 50 samples are generated, from 
which jl/a is estimated. These steps are repeated 1000 times to obtain the 
sampling variation of the estimates. We also study the n = 1000 case in order 
to investigate the consistency of the estimators. The results are summarized 
in Figure 5 and Table 1. 

The distribution of estimators are shown for n = 50, 1000 in Figure 5 
and the proportion of estimators greater than 2 is summarized in Table 1. 
When n = 1000, both estimators are good in terms of the proportion of 

Table 1 

Proportion of estimates greater than 2 from the data illustrated in Figure 5. For puja — 3, 
shown are proportions of correct answers from each estimator. For fj,/o~ = 1 or 0, shown 
are proportions of incorrect answers 



Method 


fi/a = 3 


p./cr = 2 


fi/a = 1 


fx/er = 


MLE, n = 50 


98.5 


55.2 


5.2 


6.8 


Robust, n = 50 


95.0 


50.5 


4.7 


1.4 


MLE, n = 1000 


100 


51.9 








Robust, n = 1000 


100 


50.5 
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Fig. 5. Simulation results of the proposed estimators for the ratio jx/a . Different ratios 
represent different underlying distributions. For example, estimators in the top left are 
based on random samples from a folded Normal distribution with mean \i = 3, standard 
deviation a = 1. Curves are smooth histograms of estimates from 1000 repetitions. The 
thick black curve represents the distribution of the robust estimator from n — 50 samples. 
Likewise, the thick red curve is for the MLE with n = 50, the dotted black curve is for the 
robust estimator with n — 1000, and the dotted red curve is for the MLE with n — 1000. 
The smaller sample size represents a usual data analytic situation, while the n = 1000 case 
shows an asymptotic situation. 



correct answers. In the following, the proportions of correct answers are 
corresponding to n = 50 case. The top left panel in Figure 5 illustrates 
the circle-centered case with ratio 3. The estimated ratios from the robust 
approach give correct solutions (greater than 2) 95% of the time (98.5% 
for likelihood approach). For the borderline case (ratio 2, top right), the 
small circle will be used about half the time. The center-clustered case is 
demonstrated with the true ratio 1, that also gives a reasonable answer 
(proportion of correct answers 95.3% and 94.8% for the robust and likelihood 
answers respectively). It can be observed that when the true ratio is zero, 
the robust estimates are far from (the bottom right in Figure 5). However, 
this is expected to occur because the proportion of uncontaminated data is 
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low when the ratio is too small. However, those 'inaccurate' estimates are 
around 1 and less than 2 most of the time, which leads to 'correct' answers. 
The likelihood approach looks somewhat better with more hits near zero, 
but an asymptotic study [Johnson (1962)] showed that the variance of the 
maximum likelihood estimator converges to infinity when the ratio tends to 
zero, as glimpsed in the long right tail of the simulated distribution. 

In summary, we recommend use of the robust estimators (8), which are 
computationally light, straightforward and stable for all cases. 

In addition, we point out that Gray, Geiser and Geiser (1980) and Rivest 
(1999) proposed to use a goodness-of-fit statistic to test whether the small 
circle fit is better than a geodesic fit. Let r g and r c be the sums of squares 
of the residuals from great and small circle fits. They claimed that V = 
(n — 3)(r g — r c )/r c is approximately distributed as i*i in -3 f° r a large n if the 
great circle was true. However, this test does not detect the case depicted in 
Figure 2d. The following numerical example shows the distinction between 
our approach and the goodness-of-fit approach. 

Example 1 . Consider the sets of data depicted in Figure 2. The goodness- 
of-fit test gives p-values of 0.51, 0.11359, and 0.0008 for (a)-(d), respec- 
tively. The estimated ratios fx/a are 14.92,16.89, 14.52 and 1.55. Note that 
for (d), when the least-squares circle is too small, our method suggests to 
use a geodesic fit over a small circle while the goodness-of-fit test gives sig- 
nificance of the small circle. The goodness-of-fit method is not adequate to 
suppress the overfitting small circle in a way we desire. 

A referee pointed out that the transition of the principal circle between 
great circle and small circle is not continuous. Specifically, when the data set 
is perturbed so that the principal circle becomes too small, then the principal 
circle and principal circle mean are abruptly replaced by a great circle and 
geodesic mean. As an example, we have generated a toy data set spread along 
a circle with some radial perturbation. The perturbation is continuously 
inflated, so that with large inflation, the data are no longer circle-clustered. 
In Figure 6 the [ija changes smoothly, but once the estimate hits 2 (our 
criterion), there is a sharp transition between small and great circles. Sharp 
transitions do naturally occur in the statistics of manifold data. For example, 
even the simple geodesic mean can exhibit a major discontinuous transition 
resulting from an arbitrarily small perturbation of the data. However, the 
discontinuity between small and great circles does seem more arbitrary and 
thus may be worth addressing. An interesting open problem is to develop a 
blended version of our two solutions, for values of /i/<r near 2, which could 
be done by fitting circles with radii that are smoothly blended between the 
small circle radius and tt/2. 
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inflation = 2.1, yJa- 2.086 inflation = 2.2, pfow 1.927 




Fig. 6. (Left) The estimate n/a decreases smoothly as the perturbation is inflated. (Cen- 
ter, right) Snapshots of the toy data on a sphere. A very small perturbation of the data set 
leads to a sharp transition between the small circle (center) and great circle (right). 

4. Principal arc analysis on direct product manifolds. The discussions 
of the principal circles in S 2 play an important role in defining the principal 
arcs for data in a direct product manifold M = M\ ® M2 <8> • • • <8> Md, where 
each Mi is one of the simple manifolds S , S 2 , R+ and R. We emphasize 
again that the curvature of the direct product manifold M is mainly due to 
the spherical components. 

Consider a data set xi, . ■ ■ ,x n € M, where Xi = (xj, . . . , xf) such that x\ £ 
My Denote do>d for the intrinsic dimension of M. The geodesic mean x of 
the data is defined component- wise for each simple manifold Mj. Similarly, 
the tangent plane at x, T S M, is also defined marginally, that is, T S M is 
a direct product of tangent spaces of the simple manifolds. This tangent 
space gives a way of applying Euclidean space-based statistical methods by 
mapping the data onto T S M. We can manipulate this approximation of the 
data component- wise. In particular, the marginal data on the S 2 components 
can be represented in a linear space by a transformation h$, depending on 
the principal circles, that differs from the tangent space approximation. 

Since the principal circles 5 capture the nongeodesic directions of varia- 
tion, we use the principal circles clS £1X6 S. which can be thought of as flat- 
tening the quadratic form of variation. In principle, we require a mapping 
hg:S 2 —¥ M 2 to have the following properties: For 5 = (61,62) = ((<5i(c,r),u): 

• u is mapped to the origin, 

• 6\ is mapped onto the x-axis, and 

• 62 is mapped onto the y-axis. 

Two reasonable choices of the mapping h$ will be discussed in Section 4.1 
in detail. 

The mapping h$ and the tangent space projection together give a linear 
space representation of the data where the Euclidean PCA is applicable. The 
line segments corresponding to the sample principal component direction of 



16 



S. JUNG, M. FOSKEY AND J. S. MARRON 



the transformed data can be mapped back to M, and become the principal 
arcs. 

A procedure for principal arc analysis is as follows: 

(1) For each j such that Mj is S 2 , compute principal circles 5 = 5(x\, . . . , x^) 

and the ratio fi/cr. If the ratio is greater than the predetermined value 
e = 2, then 5 is adjusted to be great circles as explained in Section 2. 

(2) Let h : M — > R do be a transformation h(x) = . . . , hd(x d )). Each 
component of h is defined as 

h . {x j ) = {h- s {x j ) . for M j = S 2 , 
3 I^Log^x- 7 ) otherwise, 

where Log^j and are defined in the Appendix and Section 4.1, re- 
spectively. 

(3) Observe that h(xi), . . . , h(x n ) € M. d ° always have their mean at the ori- 
gin. Thus, the singular value decomposition of the do x n data ma- 
trix X = [h(a;i) • • • h(x n )] can be used for computation of the PCA. Let 
Vi,V2, ...,v m be the left singular vectors of X corresponding to the 
largest m singular values. 

(4) The kth principal arc is obtained by mapping the direction vectors 
onto M by the inverse of h, which can be computed component-wise. 

The principal arcs on M are not, in general, geodesies. Nor are they 
necessarily circles, in the original marginal S 2 . This is because h$ and its 
inverse 1 are nonlinear transformations and, thus, a line on R 2 may not be 
mapped to a circle in S 2 . This is consistent with the fact that the principal 
components on a subset of variables are different from projections of the 
principal components from the whole variables. 

Principal arc analysis for data on direct product manifolds often results 
in a concise summary of the data. When we observe a significant variation 
along a small circle of a marginal S 2 , that is most likely not a random 
artifact but, instead, the result of a signal driving the circular variation. 
Nongeodesic variation of this type is well captured by our method. 

Principal arcs can be used to reduce the intrinsic dimensionality of M. 
Suppose we want to reduce the dimension by k, where k can be chosen 
by inspection of the scree plot. Then each data point x is projected to a 
A;-dimensional submanifold Mq of M in such a way that 

where the Vj's are the principal direction vectors in ]R rf °, found by step 3 
above. Moreover, the manifold Mq can be parameterized by the k principal 
components zi,...,Zj. such that Mq(z\, . . . , z^) = h~ 1 (^(° =1 ZjVj). 
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4.1. Choice of the transformation hg. The transformation hg : S 2 — > M 2 
leads to an alternative representation of the data, which differs from the 
tangent space projection. The hg transforms nongeodesic scatters along 5± 
to scatters along the x-axis, which makes a linear method like the PCA ap- 
plicable. Among many choices of transformations that satisfy the three prin- 
ciples we stated, two methods are discussed here. Recall that 6= (61,82) = 
(<5i(c,r),u). 

Projection. The first approach is based on the projection of x onto S\, 
defined in (3), and a residual £. The signed distance from u to Ps 1 x, whose 
unsigned version is defined in (4), becomes the x-coordinate, while the resid- 
ual £ becomes the y-coordinate. This approach has the same spirit as the 
model for the circle class (5), since the direction of the signal is mapped to 
the x-axis, with the perpendicular axis for errors. 

The projection hg(x) that we define here is closely related to the spherical 
coordinate system. Assume c = e3, and u is at the Prime meridian (i.e., 
on the x — z plane). For x and its spherical coordinates (<j),0) such that 
x = (xi,X2,xs) = (cos</>sin#,cos(/>sin#,cos#), 



where 6 U = cos -1 (u3) is the latitude of u. The set of hgfa) has mean zero 
because the principal circle mean u has been subtracted. 

Conformal map. A conformal map is a function which preserves angles. We 
point out two conformal maps that can be combined to serve our purpose. 
See Chapter 9 of Churchill and Brown (1984) and Krantz (1999) for detailed 
discussions of conformal maps. A conformal map is usually defined in terms 
of complex numbers. Denote the extended complex plane C U {co} as C*. 
Let (j) c : S 2 C* be the stereographic projection of the unit sphere when 
the point antipodal from c is the projection point. Then (f> c is a bijective 
conformal mapping defined for all S 2 that maps 5\ as a circle centered at 
the origin in C*. The linear fractional transformation, sometimes called the 
Mobius transformation, is a rational function of complex numbers, that can 
be used to map a circle to a line in C*. In particular, we define a linear 
fractional transformation f u * : C* — > C* as 



where u* = </> c (u), and a is a constant scalar. Then the image of 6\ under 
fu* ° 4>c is the real axis, while the image of 62 is the imaginary axis. The map- 
ping hg:S 2 ^ M? is defined by f u * o <fi c with the resulting complex numbers 
understood as members of M. 2 . Note that orthogonality of any two curves 
in S 2 is preserved by the hg but the distances are not. Thus, we use the 



(9) 




(10) 




if z —u* 
if z = —u* 
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Fig. 7. Illustration of projection hg [left, equation (9)] and conformal hg [center, equation 
(10)] compared to a tangent plane projection at the geodesic mean (right) of the data in 
Figure 2c. The hg maps the variation along <5i to the variation along the x-axis, while the 
tangent plane mapping fails to do so. 



scale parameter a of the function f u * to match the resulting total variance 
of /ij(xj) to the geodesic variance of Xj. 

In many cases, both projection and conformal /ij give better representa- 
tions than just using the tangent space. Figure 7 illustrates the image of 
hg with the toy data set depicted in Figure 2c. The tangent space map- 
ping is also plotted for comparison. The tangent space mapping leaves the 
curvy form of variation, while both hfs capture the variation and lead to 
an elliptical distribution of the transformed data. 

The choice between the projection and conformal mappings is a matter of 
philosophy. The image of the projection /ij is not all of M 2 , while the image 
of the conformal hg is all of M 2 . However, in order to cover M? completely, 
the conformal hg can grossly distort the covariance structure of the data. 
In particular, the data points that are far from u are sometimes overly 
diffused when the conformal hg is used, as can be seen in the left tail of 
the conformal mapped image in Figure 7. The projection hg does not suffer 
from this problem. Moreover, the interpretation of projection hg is closely 
related to the circle class model. Therefore, we recommend the projection 
hg, which is used in the following data analysis. 

5. Application to m-rep data. In this section an application of Principal 
Arc Analysis to the medial representation (m-rep) data example, introduced 
below in more detail, is described. 

5.1. Medial representation. The m-rep gives an efficient way of represent- 
ing 2- or 3-dimensional objects. The m-rep is constructed from the medial 
axis, which is a means of representing the middle of geometric objects. The 
medial axis of a 3-dimensional object is formed by the centers of all spheres 
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Fig. 8. (Left) An m-rep model with 3x5 grids of medial atoms. Each atom has its 
location (R 3 ) and two equal-length spokes (BL+ ® S 2 ® S 2 ). (Right) The implied surface of 
the m-rep model, showing a prostate shape. 

that are interior to objects and tangent to the object boundary at two or 
more points. In addition, the medial description is defined by the centers of 
the inscribed spheres and by the associated vectors, called spokes, from the 
sphere center to the two respective tangent points on the object boundary. 
The medial axis is sampled over an approximately regular lattice and the 
elements of the lattice are called medial atoms. A medial atom consists of 
the location of the atom combined with two equal-length spokes, defined as 
a 4-tuple: 

• location in R 3 ; 

• spoke direction 1, in S* 2 ; 

• spoke direction 2, in S 2 ; 

• common spoke length in M + ; 

as shown in Figure 8. The size of the regular lattice is fixed for each object 
in practice. For example, the shape of a prostate is usually described by a 
3x5 grid of medial atoms, across all samples. The collection of the medial 
atoms is called the medial representation (m-rep). An m-rep corresponds 
to a particular shape of prostate, and is a point in the m-rep space M. 
The space of prostate m-reps is then M = (K 3 ® R+ <g> S 2 <g> S 2 ) 15 , which 
is a 120-dimensional direct product manifold with 60 components. The m- 
rep model provides a useful framework for describing shape variability in 
intuitive terms. See Siddiqi and Pizer (2008) and Pizer et al. (2003) for 
detailed introduction to and discussion of this subject. 

An important topic in medical imaging is developing segmentation meth- 
ods of 3D objects from CT images; see Cootes and Taylor (2001) and Pizer 
et al. (2007). A popular approach is similar to a Bayesian estimation scheme, 
where the knowledge of anatomic geometries is used (as a prior) together 
with a measure of how the segmentation matches the image (as a likeli- 
hood). A prior probability distribution is modeled using m-reps as a means 
of measuring geometric atypicality of a segmented object. PC A- like methods 
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(including PAA) can be used to reduce the dimensionality of such a model. 
A detailed description can be found in Pizer et al. (2007). 

5.2. Simulated m-rep object. The data set partly plotted in Figure 1 
is from the generator discussed in Jeong et al. (2008). It generates random 
samples of objects whose shape changes and motions are physically modeled 
(with some randomness) by anatomical knowledge of the bladder, prostate 
and rectum in the male pelvis. Jeong et al. have proposed and used the 
generator to estimate the probability distribution model of shapes of human 
organs. 

In the data set of 60 samples of prostate m-reps we studied, the major 
motion of prostate is a rotation. In some S 2 components, the variation cor- 
responding to the rotation is along a small circle. Therefore, PAA should fit 
better for this type of data than principal geodesies. To make this advantage 
more clear, we also show results from a data set by removing the location 
and the spoke length information from the m-reps, the sample space of which 
is then {S 2 } 30 . 

We have applied PAA as described in the previous section. The ratios fi/a, 
estimated for the 30 S 2 components, are in general large (with minimum 
21.2, median 44.1 and maximum 118), which suggests use of small circles to 
capture the variation. 

Figure 9 shows the proportion of the cumulative variances, as a function 
of number of components, from the Principal Geodesic Analysis (PGA) of 
Fletcher et al. (2004) and PAA. In both cases, the first principal arc leaves 
smaller residuals than the first principal geodesic. What is more important 
is illustrated in the scatterplots of the data projected onto the first two 
principal components. The quadratic form of variation that requires two 
PGA components is captured by a single PAA component. 

The probability distribution model estimated by principal geodesies is 
qualitatively different from the distribution estimated by PAA. Although 
the difference in the proportion of variance captured is small, the resulting 
distribution from PAA is no longer elliptical. In this sense, PAA gives a 
convenient way to describe a nonelliptical distribution by, for example, a 
Normal density. 

5.3. Prostate m-reps from real patients. We also have applied PAA to a 
prostate m-rep data set from real CT images. Our data consist of five pa- 
tients' image sets, each of which is a series of CT scans containing prostate 
taken during a series of radiotherapy treatments [Merck (2008)]. The prostate 
in each image is manually segmented by experts and an m-rep model is fit- 
ted. The patients, coded as 3106, 3107, 3109, 3112 and 3115, have different 
numbers of CT scans (17, 12, 18, 16 and 15, respectively). We have in total 
78 m-reps. 
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Fig. 9. (Top) The proportion of variances captured by the first few components of PA A 
are compared to those from PGA for the simulated prostate m-reps. (Bottom) Scatter plots 
of the data on {S 2 } 30 show that the major variation is explained more concisely by the 
first principal arc. 



The proportion of variation captured in the first principal arc is 40.89%, 
slightly higher than the 40.53% of the first principal geodesic. Also note that 
the estimated probability distribution model from PAA is different from that 
of PGA. In particular, PAA gives a better separation of patients in the first 
two components, as depicted in the scatter plots (Figure 10). 

6. Doubly iterative algorithm to find the least-squares small circle. We 

propose an algorithm to fit the least-squares small circle (1), which is a con- 
strained nonlinear minimization problem. This algorithm is best understood 
in two iterative steps: The outer loop approximates the sphere by a tangent 
space; the inner loop solves an optimization problem in the linear space, 
which is much easier than solving (1) directly. In more detail, the (k + l)th 
iteration works as follows. The sphere is approximated by a tangent plane 
at Cfc, the kth. solution of the center of the small circle. For the points on the 
tangent plane, any iterative algorithm to find a least-squares circle can be 
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Fig. 10. The scatter plots of the real prostate ra-rep data. Different symbols represent 
different patients. PAA (right) gives a better separation of different patients in the first 
two components compared to PGA (left). 



applied as an inner loop. The solution of the inner iteration is mapped back 
to the sphere and becomes the {k + l)th input of the outer loop operation. 
One advantage of this algorithm lies in the reduced difficulty of the optimiza- 
tion task. The inner loop problem is much simpler than (1) and the outer 
loop is calculated by a closed-form equation, which leads to a stable and fast 
algorithm. Another advantage can be obtained by using the exponential map 
and log map (12) for the tangent projection, since they preserve the distance 
from the point of tangency to the others, that is, p(x, c) = || Log c (x) || for any 
x6S 2 . This is also true for radii of circles. The exponential map transforms 
a circle in M 2 centered at the origin with radius r to 5(c,r). Thus, whenever 
(1) reaches its minimum, the algorithm does not alter the solution. 

We first illustrate necessary building blocks of the algorithm. A tangent 
plane T c at c can be defined for any c in S 2 , and an appropriate coordinate 
system of T c is obtained as follows. Basically, any two orthogonal comple- 
ments of the direction c can be used as coordinates of T c . For example, 
when c = (0,0, 1)' = e3, a coordinate system is given by ei and &2- For a 
general c, let q c be a rotation operator on M 3 that maps c to e3. Then a 
coordinate system for T c is given by the inverse of q c applied to ei and e2, 
which is equivalent to applying q c to each point of S 2 and using ei, &2 as 
coordinates. 

The rotation operator q c can be represented by a rotation matrix. For 
c = (c x ,c y ,c z )', the rotation q c is equivalent to rotation through the angle 
9 = cos~ 1 (c z ) about the axis u = (c y , —c x ,0)'/ \J 1 — c 2 , whenever c ^ ±63. 
When c = ±e3, u is set to be ex- It is well known that a rotation matrix 
with axis u = (u x ,u y ,u z )' and angle 9 in radians is, for c = cos(f5), s = sin(0) 
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and v = 1 — cos(0), 

(c + u 2 v u x u y v — u z s u x u z v + u y s\ 

U x U y V + U z S C + UyV u y u z v-u x s , 

u x u z v — u y s u y u z v + u x s c + u 2 v J 

so that (?c( x ) = -Rc x ) for x € R 3 . 

With the coordinate system for T c , we shall define the exponential map 
Exp c , a mapping from T c to S 2 , and the log map Log c = Exp" 1 . These are 
defined for v = (v\,V2) € M 2 and x = (xi,X2, X3)' € S 2 , as 

(12) Exp c (v) =g c oExp e3 (v), Log c (x) = Log e3 o g c (x), 

for = cos (0:3). See (18)-(19) in the Appendix for Exp e3 and Log e3 . Note 
that Log c (c) = and Log c is not defined for the antipodal point of c. 

Once we have approximated each Xj by Log c (xj) = Xj, the inner loop finds 
the minimizer (v, r) of 

n 

(13) min^(||xj - v|| - r) 2 , 

i=i 

which is to find the least-squares circle centered at v with radius r. The 
general circle fitting problem is discussed in, for example, Umbach and Jones 
(2003) and Chernov (2010). This problem is much simpler than (1) because 
it is an unconstrained problem and the number of parameters to optimize is 
decreased by 1. Moreover, optimal solution of r is easily found as 

1 n 

(14) f=- V|| Xi -v||, 

n 

i=l 

when v is given. Note that for great circle fitting, we can simply put f = 
7r/2. Although the problem is still nonlinear, one can use any optimization 
method that solves nonlinear least squares problems. We use the Levenberg- 
Marquardt algorithm, modified by Fletcher (1971) [see Chapter 4 of Scales 
(1985) and Chapter 3 of Bates and Watts (1988)], to minimize (13) with r 
replaced by f. One can always use v = as an initial guess since = Log c (c) 
is the solution from the previous (outer) iteration. 
The algorithm is now summarized as follows: 

(1) Given {xi,...,x n }, c = x x . 

(2) Given c^, find a minimizer v of (13) with r replaced by (14), with inputs 
x ; = L °gc fc ( x i)- 

(3) If ||v|| < e, then iteration stops with the solution c = c^, r = f as in (14). 
Otherwise, c^ +1 =Exp Cfe (v) and go to step 2. 
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Note that the radius of the fitted circle in T c is the same as the radius of 
the resulting small circle. There could be many variations of this algorithm: 
as an instance, one can elaborate the initial value selection by using the 
eigenvector of the sample covariance matrix of Xj's, corresponding to the 
smallest eigenvalue as done in Gray, Geiser and Geiser (1980). Experience 
has shown that the proposed algorithm is stable and speedy enough. Gray, 
Geiser and Geiser proposed to solve (1) directly, which seems to be unstable 
in some cases. 

The idea of the doubly iterative algorithm can be applied to other opti- 
mization problems on manifolds. For example, the geodesic mean is also a 
solution of a nonlinear minimization, where the nonlinearity comes from the 
use of the geodesic distance. This can be easily solved by an iterative approx- 
imation of the manifold to a linear space [See Chapter 4 of Fletcher (2004)] , 
which is the same as the gradient descent algorithms [Pennec (2006), Le 
(2001)]. Note that the proposed algorithm, like other iterative algorithms, 
only finds one solution even if there are multiple solutions. 

APPENDIX: SOME BACKGROUND ON DIRECT PRODUCT 

MANIFOLD 

We give some necessary geometric background on direct product mani- 
folds. Precise definitions and geometric discussions on a richer class of man- 
ifold, Riemannian manifold, can be found in Boothby (1986) and Helgason 



A d-dimensional manifold can be thought of as a curved surface embedded 
in a Euclidean space of higher dimension d! (>d). The manifold is required 
to be smooth, that is, infinitely differ entiable, so that a sufficiently small 
neighborhood of any point on the manifold can be well approximated by 
a linear space. The tangent space at a point p of a manifold M, T p M, is 
defined as a linear space of dimension d which is tangent to M at p. The 
notion of distance on M is handled by a Riemannian metric, which is a metric 
of tangent spaces. In particular, the geodesic distance function pm(p,q) is 
roughly defined as the length of the shortest curve joining p,qEM. 

Now consider a direct product manifold M = Mi <g> ■ ■ • <g> M m . We restrict 
our attention to the case where each Mj is one of the simple manifolds 
S 1 ,S 2 ,M. + or R, but most of the assertions below apply equally well to 
direct products of more general manifolds. 

Geodesic distance function. The geodesic distance between p = (p , . . . ,p m ) 
and q = (q , . . . , q m ) is defined by 



(2001). 
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where each pm { is the geodesic distance function on M\. The geodesic dis- 
tance on S 1 is defined by the length of the shortest arc. Similarly, the 
geodesic distance on S 2 is defined by the length of the shortest great circle 
segment. The geodesic distance on R + needs special treatment. In many 
practical applications, R+ represents a space of scale parameters. A desir- 
able property for a metric on scale parameters is scale invariance, p(rx, ry) = 
p(x,y) for any x,y,r G R+. This can be achieved by differencing the logs, 
that is, 



(15) PR+(x,y) 



log- 

y 



for x, y £ ' 



Finally, the geodesic distance on a simple manifold R or R d is the Euclidean 
distance. 

Geodesic mean and variance. The geodesic mean of a set of points in M, 
also referred to as the intrinsic mean, is also calculated component-wise. 
The geodesic mean of xi, . . . ,x n € M is the minimizer in M of the sum of 
squared geodesic distances to the data. Thus, the geodesic mean is defined 
as 

1 - 

(16) x = argmin-y^p| // (x,x i ). 

In fact, each x l of x = {x , ...,x m ) is the geodesic mean of x\,...,x l n € 
Mi. The geodesic mean of 9±, . . . , 9 n € [0, 27r) = 5 1 is found by examining a 
candidate set consisting of 

(17) g SE*ft + 2jtt J = ,...,n-1, 

n 

as in Moakher (2002). The geodesic mean in 5 2 can be calculated by a full 
grid search or an iterative algorithm, as described in Section 6. The geodesic 
mean in M + or R is straightforward. Note that the geodesic mean may not 
be unique. However, throughout the paper, we have assumed that the data 
have a unique geodesic mean which is true in most data analytic situations. 
Statistical investigation of the geodesic mean on manifolds can be found, for 
example, in Bhattacharya and Patrangenaru (2003, 2005) and Le and Kume 
(2000). 

A related notion is geodesic variance. A sample geodesic variance is de- 
fined by the average squared geodesic distances to the geodesic mean, that is, 
h Y!i=i Pm(%i x i)- When M is indeed the Euclidean space, the geodesic vari- 
ance is the same as the total variance (the trace of the variance-covariance 
matrix) . 

Mappings to tangent space. The exponential map maps a point in T p M to 
M. The log map is the inverse exponential map whose domain is in M. For a 
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direct product manifold M, the mappings are also defined component-wise. 
For S 1 , let 6 R denote an element of T p S l where p is set to be (1,0) € S 1 
embedded in R 2 . Then the exponential map is defined as 

Exp p (<9) = (cos sin (9). 

The corresponding log map of x = (xi,X2) is defined as Log p (x) = sign(x2) • 
arccos(xi). For S 2 , let v= (v\,V2) denote a tangent vector in T P S 2 . Let 
p = e3, then the exponential map Exp p : T p S 2 — > S 2 is defined by 

(18) Exp (v) = ( -Tj-~]7 sin || v || , y~~u s ^ n II V H' cos II v l 



This equation can be understood as a rotation of the base point p to the 
direction of v with angle ||v||. The corresponding log map for a point x = 

(x 1 ,x 2 ,x 3 ) 6 S 2 is given by 

(!9) Log (x) = (xi-^—,x 2 -?- 

v \ sm.0 sm( 

where 6 = arccos(x3) is the geodesic distance from p to x. Note that the 
antipodal point of p is not in the domain of the log map, that is, the domain 
of Log is S 2 /{— p}. In both 5 1 and S 2 , p can be set to be any point on 
S 1 or S 2 . The exponential and log map for an arbitrary T p can be defined 
together with a rotation operator, which is defined in Section 6 for the case 
of S 2 . The exponential map of R+ is defined by the standard real exponential 
function. The domain of the inverse exponential map, the log map, is IR+ 
itself. Finally, the exponential map on R is the identity map. 
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